feat(optimize): MCP tool coverage detector with cache-aware costing by ozymandiashh · Pull Request #223 · getagentseal/codeburn

ozymandiashh · 2026-05-05T01:13:47Z

Summary

Closes #2.

Adds a per-tool optimizer finding for MCP servers whose schema is loaded on every turn but rarely invoked. Builds on the existing server-level detectUnusedMcp (zero invocations) by reporting partial-use cases like "loaded 54 tools, called 0" or "loaded 26 tools, called 2 (8% coverage)".

Smoke-tested on a real account: 7 servers flagged across 93 sessions — office-word-mcp 0/54, notebooklm-mcp 0/38, office-ppt-mcp 0/37, excel-mcp-server 0/25, github-mcp-server 2/26, peekaboo 3/22, plus claude_ai_Asana.

Inventory source

Claude Code's JSONL writes attachment.deferred_tools_delta entries whose addedNames array lists the exact tools available at that turn — including every fully-qualified mcp__<server>__<tool> name. We union across all delta entries in a session (not just the first) because tool availability can change mid-session when MCP config reloads or a subagent inherits a different tool set.

Names that don't match the mcp__<server>__<tool> shape with both segments non-empty are rejected at extraction so downstream split('__') consumers can't be poisoned.

Token-savings estimation

MCP tool schemas live in the cached prefix of the system prompt:

Each cache-creation event (rebuilds happen every ~5 minutes of inactivity) pays the full input price.
Subsequent turns pay the cache-read discount (~10% of input).
Each call's contribution is capped at its observed cacheCreationInputTokens / cacheReadInputTokens so we never claim more MCP overhead than the call's own cache buckets could contain.

When multiple servers are flagged, costing is a single combined pass: the per-call cap applies to the total unused-schema budget across all flagged servers, not per server. Two flagged servers can't independently claim the same call's cache bucket and overstate tokensSaved.

Correctness invariants

A session counts toward loadedSessions (and toward the cost estimate) only if its observed inventory included the server. Pure invocation-only sessions, where the server appears in mcpBreakdown or call.mcpTools without any matching deferred_tools_delta, do not satisfy the >= 2 sessions threshold on their own.
Coverage is computed against the inventory only: invocations of names not present in any observed inventory (older config, hallucinated tool, typo) do not inflate toolsInvoked and cannot drive unusedCount negative. toolsInvoked is derived as inventory.size - unusedTools.length to keep both numbers consistent.
detectUnusedMcp and the new detector are explicitly disjoint: detectUnusedMcp skips servers that the coverage detector will actually report (i.e. those clearing its thresholds), not every server that happens to be in any inventory. A small inventoried-but-uninvoked server below the coverage thresholds still gets flagged as "configured but never called."

Thresholds

> 10 tools available (small servers are noise)
< 20% coverage
>= 2 sessions with observed inventory
High impact when total effective tokens >= 200_000 or >= 3 servers flagged

Changes

src/types.ts: optional mcpInventory: string[] on SessionSummary. Provider-agnostic field; currently populated only by the Claude parser.
src/parser.ts: extractMcpInventory walks all entries, validates fully-qualified names, returns sorted unique list. buildSessionSummary passes it through; the field is omitted when empty so JSON exports stay clean.
src/optimize.ts: aggregateMcpCoverage, estimateMcpSchemaCost (single- and multi-server signatures), detectMcpToolCoverage. Wired into scanAndDetect. detectUnusedMcp updated to be disjoint with the new detector.
tests/mcp-coverage.test.ts: 23 cases covering aggregation, costing, combined-cap behaviour, threshold gates, invocation-only-session filtering, foreign-tool invocations, cache rebuild events, write+read on the same call, multi-server pluralisation, backward-compat single-server signature.
tests/parser-mcp-inventory.test.ts: 12 cases for the JSONL extractor including malformed name rejection (mcp__server, mcp__server__, mcp____tool) and tolerant attachment parsing.
CHANGELOG.md: entry under Unreleased / Added (CLI).

Scope notes

Claude-only. deferred_tools_delta is Claude Code-specific. The field is provider-agnostic on SessionSummary so other parsers can populate it later, but no other provider exposes the same telemetry today.
No public API change beyond the new mcpInventory optional field. All existing schemas, exports, and CLI flags are unaffected.
No dashboard panel in this PR. The optimizer is the lowest-friction surfacing path, which fits the existing waste-finding model. Open to following up with a panel if you'd like.

Test plan

npx tsc --noEmit       # 0 errors
npx vitest run         # 34 files, 462 passed (was 427 baseline + 35 new)
npm run build          # success
node dist/cli.js optimize -p week
# -> "7 MCP servers with low tool coverage" finding (High)
# -> existing "configured but never used" still flags servers below the
#    coverage thresholds (e.g. `oura` in my data)

Reviews considered

Design and implementation went through three rounds of code review (Codex GPT-5.5 high, Gemini 3.1 Pro Preview, an internal Sonnet reviewer) before this PR. Concrete findings addressed end-to-end:

Duplicate findings between the legacy and new detector
loadedSessions counted from invocation-only sessions, diluting the threshold
toolsInvoked counting tools not present in inventory
continue after cacheCreationInputTokens skipping the same call's cacheReadInputTokens
extractMcpInventory accepting malformed names
Cache rebuilds (multiple cacheCreation events per session)
Cumulative tokensSaved over-count when multiple servers flagged share a cache bucket
Inventory-vs-breakdown semantic mismatch between aggregator and cost estimator
Blind spot in detectUnusedMcp for inventoried-but-uninvoked small servers

Adds a per-tool optimizer finding for MCP servers whose schema is loaded on every turn but rarely invoked. Builds on the existing server-level `detectUnusedMcp` (zero invocations) by reporting partial-use cases: "loaded 54 tools, called 0" or "loaded 26 tools, called 2 (8% coverage)". Inventory comes from Claude Code's JSONL `attachment.deferred_tools_delta` entries: `addedNames` lists the exact tools available at that turn, including every fully-qualified `mcp__<server>__<tool>` name. We union across all delta entries in a session (not just the first) because tool availability can change mid-session when the user reloads MCP config or a subagent inherits a different tool set. Names that don't match the `mcp__<server>__<tool>` shape with both segments non-empty are rejected at extraction so downstream `split('__')` consumers can't be poisoned. Token-savings estimates are cache-aware. MCP tool schemas live in the cached prefix of the system prompt: a session pays the full input price on each cache-creation turn (rebuilds happen every ~5 minutes of inactivity) and the cache-read discount on subsequent turns. Each call's contribution is capped at its observed `cacheCreationInputTokens` / `cacheReadInputTokens` so we never claim more MCP overhead than the call's own cache buckets could contain. When multiple servers are flagged, costing happens in a single combined pass: the per-call cap applies to the total unused-schema budget across all flagged servers, not per server. Two flagged servers cannot both independently claim the same call's cache bucket, which would otherwise overstate `tokensSaved` and misclassify findings as high impact. A session counts toward `loadedSessions` (and toward the cost estimate) only if its observed inventory included the server. Pure invocation-only sessions, where the server appears in `mcpBreakdown` or `call.mcpTools` without any matching `deferred_tools_delta`, do not satisfy the `>= 2 sessions` threshold on their own. The same invariant applies in `estimateMcpSchemaCost` so the two passes agree. Coverage is computed against the inventory only: invocations of names not present in any observed inventory (older config, hallucinated tool, typo) do not inflate `toolsInvoked` and cannot drive `unusedCount` negative. `toolsInvoked` is derived as `inventory.size - unusedTools.length` to keep both numbers consistent. `detectUnusedMcp` and the new detector are explicitly disjoint: `detectUnusedMcp` skips servers that the coverage detector will report, not every server that happens to be in any inventory, so a small inventoried-but-uninvoked server below the coverage thresholds still gets flagged as "configured but never called." Thresholds for the coverage finding: - > 10 tools available (small servers are noise) - < 20% coverage - >= 2 sessions with observed inventory - High impact when total effective tokens >= 200_000 or >= 3 servers flagged Smoke-tested on a real account: 7 servers flagged across 93 sessions (`office-word-mcp` 0/54, `notebooklm-mcp` 0/38, `office-ppt-mcp` 0/37, `excel-mcp-server` 0/25, `github-mcp-server` 2/26, `peekaboo` 3/22, plus `claude_ai_Asana`). Combined-cap costing keeps `tokensSaved` honest. Changes: - src/types.ts: optional `mcpInventory: string[]` on `SessionSummary`. Provider-agnostic field; currently populated only by the Claude parser. - src/parser.ts: `extractMcpInventory` walks all entries, validates fully-qualified names, returns sorted unique list. `buildSessionSummary` passes it through; field is omitted when empty so JSON exports stay clean. - src/optimize.ts: `aggregateMcpCoverage`, `estimateMcpSchemaCost` (single- and multi-server signatures), `detectMcpToolCoverage`. Wired into `scanAndDetect`. `detectUnusedMcp` updated to disjoint with the new detector. - tests/mcp-coverage.test.ts: 23 cases covering aggregation, costing, combined-cap behaviour, threshold gates, invocation-only-session filtering, foreign-tool invocations, cache rebuild events, write+read on the same call, multi-server pluralisation. - tests/parser-mcp-inventory.test.ts: 12 cases for the JSONL extractor including malformed name rejection and tolerant attachment parsing. - CHANGELOG.md: entry under Unreleased / Added (CLI). Closes getagentseal#2

iamtoruk · 2026-05-05T01:37:08Z

Solid work. Clean separation between extractMcpInventory (parser), aggregateMcpCoverage (aggregation), estimateMcpSchemaCost (costing), and detectMcpToolCoverage (finding emission). Each piece is independently testable and tested.

35 new tests covering edge cases well: malformed names, invocation-only sessions, foreign tools, cache rebuilds, multi-server cap, threshold gates, pluralization.

Comments are dense but justified here. The domain (cache pricing, inventory semantics) is genuinely complex and the invariants are non-obvious.

Two things to address:

Double aggregation: aggregateMcpCoverage is called in both detectMcpToolCoverage and the updated detectUnusedMcp. Should compute once and pass the result.
The estimateMcpSchemaCost backward-compat overload accepts number | Record<string, number> as first arg and string | string[] as third. The single-server path does { [serverOrServers as string]: unusedToolCounts } which is safe only because the caller ensures the types match. TypeScript function overloads would be cleaner than runtime type checks.

Otherwise this is one of the highest quality external PRs on this repo. Nice iteration.

ozymandiashh · 2026-05-05T02:11:29Z

Thanks for the review. I addressed both cleanup points in e46b20b:

scanAndDetect now computes MCP coverage once and passes it into both MCP detectors, so we avoid the duplicate aggregateMcpCoverage(projects) pass.
estimateMcpSchemaCost now exposes typed overloads for the single-server and multi-server call shapes, with guarded normalization inside the implementation instead of relying on unsafe casts.

Validated locally with:

npx tsc --noEmit
npx vitest run tests/mcp-coverage.test.ts
npx vitest run
npm run build

Please take another look when you have a chance.
Thx for the kind words. I'm thinking about making a confidence meter, how sure code burn is on the usage costs and maybe also a recommendation if you should change to api or subscription, what do you think? Or do you want me to tackle more of the issues that are left first?

- Use 1.25x multiplier for cache-write tokens to match Anthropic's actual pricing (was incorrectly using 1x) - Shell-quote server names in `claude mcp remove` fix text to prevent issues with unusual server names

fix(optimize): reuse mcp coverage and type schema estimator

e46b20b

iamtoruk added 2 commits May 4, 2026 20:11

Fix cache-write pricing and shell-quote server names in fix commands

735f41b

- Use 1.25x multiplier for cache-write tokens to match Anthropic's actual pricing (was incorrectly using 1x) - Shell-quote server names in `claude mcp remove` fix text to prevent issues with unusual server names

Merge origin/main into feat/mcp-tool-coverage

5120ec6

iamtoruk merged commit 4ac8e8d into getagentseal:main May 5, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(optimize): MCP tool coverage detector with cache-aware costing#223

feat(optimize): MCP tool coverage detector with cache-aware costing#223
iamtoruk merged 4 commits intogetagentseal:mainfrom
ozymandiashh:feat/mcp-tool-coverage

ozymandiashh commented May 5, 2026

Uh oh!

iamtoruk commented May 5, 2026

Uh oh!

ozymandiashh commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ozymandiashh commented May 5, 2026

Summary

Inventory source

Token-savings estimation

Correctness invariants

Thresholds

Changes

Scope notes

Test plan

Reviews considered

Uh oh!

iamtoruk commented May 5, 2026

Uh oh!

ozymandiashh commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ozymandiashh commented May 5, 2026 •

edited

Loading